Math Review

Chelsea Parlett-Pelleriti

Calculus + Linear Algebra

Derivatives

Derivative: Instantaneous rate of change

Note: We most often care about when derivatives are 0. Why?

Integration

Integrals: Area under the curve \(\int_{-\infty}^{-1} f(x) dx\)

Note: If a curve is a probability distribution (proper) what is the AUC?

Gradients

Gradients

\[f = x^2 + xy + y^2\]

How does \(f(x,y)\) when \(x\) changes? when \(y\) changes?

\[\frac{\partial f}{\partial x} = 2x + y\]

\[\frac{\partial f}{\partial y} = x + 2y\]

Gradients

Shove those in a vector, and you’ve got a gradient.

\[ \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix} = \begin{bmatrix} 2x + y \\ x + 2y \end{bmatrix} \]

Note: what do gradients tell us about a function?

Reading a Contour Plot

Important Matrices

  • covariance
  • correlation
  • hessian
  • design matrix

Covariance and Correlation Matrices

Covariance: Do two variables move together? \[Corr(x,y) = \frac{Cov(x,y)}{sd(x)sd(y)}\]

Hessian

gradients: first derivatives of a multivariable function \(f(x,y)\)

hessians : second derivatives of a multivariable function \(f(x,y)\)

\[\begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial f^2}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \\ \end{bmatrix}\]

Note: What does a second derivative tell you in a single variable function \(f(x)\)?

Hessian

\[f = x^2 + xy + y^2\]

\[\frac{\partial f}{\partial x} = 2x + y; \frac{\partial f}{\partial y} = x + 2y\] \[\begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial f^2}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \\ \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 1 & 2 \\ \end{bmatrix}\]

Hessian

Where is \(f''(x) = 0\)? positive? negative?

Hessian

Design Matrix

a design matrix in a linear model is a matrix of all the explanatory variables, \(X\).

\[y = X\beta\]

  • simple case: columns of \(X\) are each vectors of predictor variables (e.g. height, age…)

  • complicated case: one-hot encoding, intercept, polynomial terms

Design Matrix

\[\underset{n\times (p+1)}{X} = \begin{bmatrix} 1 & 24 & 170 \\ 1 & 21 & 162 \\ ... & ... & ... \\ 1 & 22 & 150 \\ \end{bmatrix}; \underset{(p+1)\times 1}{\beta} = \begin{bmatrix}3.2 \\ 0.1 \\ 0.02 \end{bmatrix}\]

Determinants

the determinant of \(A\) is the area of the unit square after multiplying by \(A\)

\[ \begin{bmatrix} 1 & 0.5 \\ 0.5 & 1 \end{bmatrix} \]

Determinants

the determinant of \(A\) is the area of the unit square after multiplying by \(A\)

\[ \begin{bmatrix} 1 & 2 \\ 0.5 & 1 \end{bmatrix} \]

Trace

trace: sum of the diagonal elements

\[ \begin{bmatrix} 1 & 2 \\ 0.5 & 1 \end{bmatrix} \]

\(tr \left ( \underset{n \times n}{A} \right ) = \sum_{i=1}^n \lambda_i\)

Important Matrix Decompositions

  • Singular Value
  • Eigen
  • QR
  • Cholesky

Singular Value Decomposition

  • \(U\) an orthogonal matrix; eigenvectors of \(AA^T\)
  • \(D\) is a diagonal matrix; diagonal is root of positive eigenvalues of \(AA^T\) or \(A^TA\)
  • \(V\) an orthogonal matrix; eigenvectors of \(A^TA\)

Singular Value Decomposition

We can use SVD to do PCA by decomposing the data matrix \(X\) rather than eigendecomposing the covariance matrix \(X^TX\).

if \(X = UDV^T\), \(V\) contains the eigenvectors needed to create PCs, and \(D\) contains the (scaled, root) eigenvalues for each PC.

Singular Value Decomposition

source: Wikipedia

Eigendecomposition

  • \(P\) is a matrix of eigenvectors
  • \(\Lambda\) is a diagonal matrix of eigenvalues

Eigendecomposition

\(Ax\)

Eigendecomposition

\[ B = \begin{bmatrix} 1 & 2 \\ 0.5 & 1 \end{bmatrix}, A = \begin{bmatrix} 1 & 0.5 \\ 0.5 & 1 \end{bmatrix} \]

Eigendecomposition

  • Principal Component Analysis
  • Steady State of Transition Matrix (when \(\lambda = 1\))

\[ Ax = \lambda x \]

m <- matrix(c(1.0, 0.0,
              0.4, 0.6), nrow = 2)
m
     [,1] [,2]
[1,]    1  0.4
[2,]    0  0.6
eigen(m)$values
[1] 1.0 0.6
(eigen(m)$vectors[,1] )/sum(eigen(m)$vectors[,1])
[1] 1 0

QR Decomposition

  • \(Q\): An orthogonal matrix (\(Q^T = Q^{-1}\))
  • \(R\): An upper triangular matrix

QR Decomposition

\[ \begin{bmatrix} 1 & 3\\ 1 & -1 \end{bmatrix} = \begin{bmatrix} \frac{1}{\sqrt2} & \frac{1}{\sqrt2}\\ \frac{1}{\sqrt2} & -\frac{1}{\sqrt2} \end{bmatrix} \begin{bmatrix} \sqrt2 & \sqrt2\\ 0 & 2\sqrt2 \end{bmatrix} \]

QR Decomposition

Solve \(X\beta = y\) subject to \(\arg \underset{\beta}{\min}\left\| X\beta - y \right\|_2\)

Could do

\[ \left( X\beta - y\right)^T\left( X\beta - y\right) \\ \beta^TX^TX\beta - \beta^TX^Ty - y^TX\beta + y^Ty \\ \beta^TX^TX\beta- 2y^TX\beta + y^T y \\ \nabla_\beta [\beta^TX^TX\beta- 2 y^TX\beta + y^Ty] = 2X^TX\beta - 2X^T y \\ 2X^TX\beta - 2X^T y = 0 \\ X^TX\beta = X^Ty \\ \beta = \underbrace{\left(X^TX\right)^{-1}X^T}_\text{Moore-Penrose Inverse} y \]

But finding inverses SUCKS.

QR Decomposition

Substitute \(A = QR\).

\[ A^TAx = A^Tb \to \left(QR \right)^T\left(QR \right)x = \left(QR \right)^Tb \\ R^TQ^TQRx = R^TQ^Tb \\ R^TRx = R^TQ^Tb \\ Rx = Q^Tb \\ Rx = v \]

Since \(R\) is upper triangular, it’s easy to solve with backsubstitution. No inverses!!

(Sidetrack) Back Substitution

\[ x_1 + 2x_2 -x_3 = 3 \\ -2x_2 - 2x_3 = -10 \\ -6x_3 = -18 \] \[ A = \begin{bmatrix} 1 & 2 & -1 \\ 0 & -2 & -2 \\ 0 & 0 & -6 \end{bmatrix}, A\begin{bmatrix}x_1 \\ x_2 \\ x_3 \end{bmatrix} = \begin{bmatrix}3 \\ -10 \\ -18 \end{bmatrix} \]

Cholesky Decomposition

For \(A\), a symmetric matrix,

  • \(L\) is a lower triangular matrix
  • \(L^T\) is an upper triangular matrix, the transpose of \(L\)

Basically the Square Root of a Matrix (and a special case of \(LU\) decomp)

Cholesky Decomposition

Cholesky Decomp can force vectors of random numbers to have a specified covariance!

x <- rnorm(100000)
y <- rnorm(100000)
data <- rbind(x,y)
print(round(cor(x,y),3))
[1] 0.003
desired_corr <- matrix(c(1,0.5,
                         0.5,1), nrow = 2)
# returns Upper Triangular
L <- chol(desired_corr)
# multiple data by t(L)
Z <- t(L) %*% data
# correlated!
print(round(cor(Z[1,], Z[2,]),3))
[1] 0.503

Cholesky Decomposition

Proof

Remember, \(\Sigma = LL^T\) and \(Z = LX\)

\[ \mathbb{E}\left( ZZ^T\right) = \mathbb{E}\left( (LX)(LX)^T\right) = \\ \mathbb{E}\left( LXX^TL^T\right) = \\ L\mathbb{E}\left(XX^T\right)L^T = \\ LIL^T = LL^T = \Sigma \]

Taylor Series Expansion

Taylor Series Expansion

Taylor Series Expansion

Taylor Series Expansion

Taylor Series Expansion

Taylor Series Expansion

Taylor Series Expansion

Taylor Series Expansion

Taylor Series Expansion

Taylor Series Expansion

How do we find this approximation?

\[ \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!} (x-a)^n \]

Taylor Series Expansion

How do we find this approximation?

\[ \sum_{n=0}^{\infty} \frac{f^{(n)}(0)}{n!} (x)^n = \frac{f(0)}{0!} + \frac{f'(0)}{1!} x + \frac{f''(0)}{2!} x^2 + \frac{f'''(0)}{3!} x^3 + ... \]

Taylor Series Expansion

Use?

\[ \sum_{n=0}^{\infty} \frac{f^{(n)}(0)}{n!} (x)^n = \frac{f(0)}{0!} + \frac{f'(0)}{1!} x + \frac{f''(0)}{2!} x^2 \]